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* Abstract 

* Standard procedures for drawing Inferences from complex 
« 

samples do not apply when the variables of Interest z are not 
observed directly, but must be Inferred from secondary random 
variables x that depend on z stochastically. Employing Rubin 9 s 
(1977) approach to missing data In survey research, we present a 
procedure by which reasonable Inferences can be made In such 
situations. The key is to represent knowledge about latent 
variables in the form of a predictive distribution, conditional on 
manifest variables. It is then possible to obtain the expectations 
of statistics that would have been computed if the values of the 
latent variables corresponding to sampled units were known, along 
with variance estimators that account for uncertainty due to both 
subject sampling and the latency of z. 

Key words: EM algorithm 

Incomplete data 

Latent structure 

Multiple imputation procedures 

Sampling designs 

Superpopulatlon models 
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Inferences about Latent Populations from Complex Samples* 

Introduction 

While progress has been made In recent years In estimating 
latent distributions (e.g., Andersen & Madsen, 1977; Dempster, 
Laird, & Rubin, 1977; Laird, 1978; Mlslevy, 1984, 1985; Sanathanan & 
Blumenthal, 1978), currently available procedures remain limited to 
simple random samples and are Inaccessible to the typical secondary 
user of suivey data. 1 This paper addresses the problem of 
estimating distributions under conditions that (1) data have been 
gathered from a finite population under a complex sampling design 
*nd (2) one or more variables of Interest are not observed directly, 
but must be Inferred from responses which depend upon them 
stochastically (e.g., "ability" variables under an Item response 
model). 

Two basic approaches exist for handling uncertainty due to 
sampling In a finite population (see Cassel, Sarndal, & Wretman, 
1977, for an overview). Unde . the M flxed population" or 
"randomization M approach, the only source of variation Is 
researcher f s random selection of a sample in accordance with 
probabilities under a given sampling design. Inferences are based 
on the distribution of an estimator over the samples that can occur 
under that design. Under the M superpopulation M approach, the finite 

*The author would like to thank R. Darrell Bock for calling his 
attention to the applicability of multiple imputation procedures to 
the assessment setting, and Henry Braun, Ben King, Paul Rosenbaum, 
and Don Rubin for comments on earlier drafts of this presentation. 
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population Itself Is considered a sample from a hypothetical 
superpopulatlon. A structure Is assumed for the superpopulatlon, 
Its parameters are estimated from the sampled units, and Inferences 
are drawn with respect to remaining uncertainty about nonsampled 
units. 

Extension to the latent variable case Is possible under both 
approaches. Attention Is restricted here to the randomization 
approach, although It must be admitted that the unlfleu treatment of 
uncertainty from all sources In a Bayeslan superpopulatlon solution 
(e.g., Mislevy, 1985) Is more satisfying. Given the overwhelming 
predominance of the randomization approach In applied work, however, 
:here Is clearly a place for a solution within Its framework. 

The key Idea Is to represent knowledge about latent variables 
In the form of a predictive distribution, conditional on manifest 
variables, In the manner suggested by Rubin (1977) as a way of 
handling missing responses In survey data. In a manner also 
suggested by Rubin (1978), this predictive distribution can be 
approximated numerically by repeated random drav3. Standard 
complete-data procedures may then be employed to obtain the 
expected value of any statistic that * T ould have been computed, had 
values of the latent variables been available. An accompanying 
variance estimator takes Into account uncertainty due to both 
subject sampling and to the latency of the variables of Interest. 
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Preliminaries 

Consider a population of N Identifiable units, Indexed by 1. 
Each Is characterized by a pair of real-valued vectors (Z^Y^); 
values of z are unknown for all units before observations are taken, 
although values of some components of Y may be known for all units 
(e.g., stratification variables). Z and Y will refer to the 
population matrices of these values. Interest lies In a function 
S - S(Z,Y) of the population values, but data will be obtained from 
only a sample of units. A sample design assigns probabilities p(d) 
of selection to members d offf, the set of the 2 N possible subsets 
from and may effect complexities such as stratification and 
clustering. Let D be the random variable Indicating the units 
selected in the sample. Correspondingly, (2p»y D ) ±* & random 
variable and (z^y.) a generic value, representing values of z and 
y from (or n^) designated sample units. We shall restrict our 
attention to nonlnf ormatlve sample designs, or those for which 
Pr(D - d) does not depend on unknown values of Z or Y; I.e., 
letting represent the prior known components of y, , we have 

Pr(D - d|z d .y d ) - Pr(D - dly^). 

Assumption 1: The estimator s D - s(z D ,y D ) could be used to estimate 
S If (z D >y D ) were observed. We assume 8 to be unbiased — I.e., 

A 

E^(s D - S) - 0— -with variance V - Va *^ 8 D " s > estimated by V - 
V(Zp,y D ). A normal approximation Is often employed In practice: 
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(s D - S) - N(0,V) . 

Suppose that observations from sampled unit 1 consist not of 
(z^y^f ^ rather of (x^y^, where x 1 Is a possibly 
multidimensional secondary random variable that depends 
stochastically upon z^. An example would be the observation of 
right and wrong answers to test questions, assumed to depend upon 
a latent ability parameter In an Item response model. We 3hall 
refer to unobserved variables z in the sequel as the latent 
variables, the observed variables y as collateral variables, and 
the observed variables x as Item responses. 

Assumption 2} Item responses x are governed by a model of known 
parametric form, characterized by possibly unknown parameters 6^. 
We assume conditional Independence with respect to collateral 
variables and Independence over units: 

P(x|z,y;e,) " pCxjzjg,) 



- E p(x |z ;0 ) 
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The General Solutio n 
This section provides a general solution for estimating 
functions of variables In fixed populations, when observations are 
obtained from only a sample of units and values of one or mere 
variables of Interest are not directly observed. The solution 
proceeds In two stages. The first stage approximates conditional 
or predictive distributions of the latent variables corresponding 
to sample units; that Is, 

P U d |x d ,y d ) . 

The second stage obtains marginal distributions of statistics that 
would have been computed, had values of latent variables been 
available, conditional on observed values* Of particular Interest 
as an estimator of S Is the conditional expectation of 8 given 
Xj and y.: 



if) 
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8 d " , * ( ?d'?d ) 



' V (8( . z d'yd ) l?d'?d ) 



■ ' 8( . z d«rd ) p( ?di?d«?d ) d t 



First, however, an additional assumption Is required to 
compute the conditional distribution p(z |x,y): 

m m m 

A ssumption 3; The distribution of latent variables given 
collateral variables, or p(z|y;8 9 ), follows a krown form, with 

mm & 

possibly unknown parameters ($ 2 . Furthermore, Independence is 
assumed over units: 

p(z|y;8 2 ) - n p(z, |y,;e 9 ) 

mm ™ ^ X £ 

This assumption resembles those used in superpopulation models 
for sampling from finite populations (e.g., Ericson, 1969; Royall, 
1970). 
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Stage U Estimating Conditional Distributions 

The task of stage 1 Is to approximate the conditional density 

p( 2 jl x j>yj)» Dropping the subscripts d on x and y, and denoting 
• a „ a „ a m 

(0}»t>2) by & 9 m note flr8t that 

p(z|x,y) -/ p(z|x,y;0) p(0 |x,y) dp 

m m m w w w «» www m 

-/ p(x|z,y;0) p(z|y;0) p -1 (x|y;p) p(p|x,y) d0 

■»■»■»■» w w w m m m w w w w 

[Bayes theorem] 



/ H p(x |z;0 ) p(z|y ;0 ) p'^xly^) p(0|x,y) d0 , 

^ 1 * www wwww 

(1) 



[Assumptions 7 and 3] 



where 



P(x|y;0) - / II pUjz;^) p(z|y 1 ;0 2 ) dz 



Now by Bayes Theorem, 



P(0|x,y) - p(x|y;0) p(p|y) p -1 (x|y) , 
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where 

p(x,y) - // p(x|z,y;0) p(z|y;0 v p<p|y) dz dp ; 

mm m m m m + m m mm mm 

this latter quantity does not depend on 0, so we can write simply 



p(0|x,y) ■ K / II p(r. |z;8,) p(z|y.;0 2 ) dz p((i|y) 

m m m ^ 1 * A *. m mm 



- K P(x|y;g) p(3|y) . (2) 

m m m mm 

Substituting (2) into (1) and noting that p(0 |y) = p(0) by the 
noninformativity of the sampling design, we obtain 



p(z|x,y) - K / n p(x ± |z;0 1 ) v(z\y ± # 2 ) p(8) dg . (3) 



Jtage 2: Estimating Marginal Distributions 

The task of stage 2 is to obtain the expected value of 
s(z,y) given the observed data (x,y) and p(z|x,y) from stage Is 

m m mm m m m 

l. fine 
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8 d " 8 * ( . x d'?d ) 

-E[8( ?d ,y d )|x d ,y d ] 

-/ 8(z d ,y d ) p( fd !x d ,y d ) dz d . (4) 



In words, 8* is the average of s(z,,y,) over all possible values 

of for the sample, with each value weighted by its relative 

likelihood given the observations. To the extent that 8 is a 

reasonable estimator of S, then, so is 8* in the latent v< rlable 

case, since 8* is the best quadratic-loss estimator of 8, given 

a 

x. and y,. 

m a - Q 

T ae magnitude of the uncertainty in 8* may be approximated 
along the line followed by Hertzog and Rubin U983). There are two 
sources of variation in 8*. First there is variation due to 
sampling. By Assumption 1, 
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s*(x,y) - R" 1 Z s(z*,y) 

mm ^ «» *> «» 

where 

z* - (z* ,*•« ,z* ) 
w r lr* ' nr 

is a value selected at random from p(z|x,y)* The sampling process 

m m m 

Is carried out R times to yield R replicate pseudo-data sets of 

the form (z*,y). The estimator 8 Is evaluated with each replicate 
<w r m 

data set In turn, and the results are averaged to provide an 
estimate of s(z 9 y) and therefore of S(Z,Y). 

mm mm 

Production of the replicate pseudo-data sets can be carried out 
In two steps* First a value B* Is selected at random from p(B)* 

m r m 

Second, because the unit distributions p(z|x 1 ,y 1 ;B) are Independent 

conditional on 0 , a value z* can be selected at random from 
m l r 

p(z|x .y ,0 - B*) for each unit In the sample separately* 
When p Is well-determined by x, and y Jf the generation of 

•d m<l 

pseudo-data sets with B* 5 0, the maximum likelihood estimate of 

m r m 

3 proves quite adequate* 
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A A 



U " W d + V d 



- R _1 I [s(z* y) 
r " " 



s*(x,y)] 2 + R" 1 I V[s(z*,y)] . (5) 

mm ^ -» r 



Again in words, one approximates the variance of s* by the 

A 

average of V(s) values for s calculated on the R pseudo-data sets, 

A 

increased by the variance of the pseudo estimates of s. When V(s) 
is given by a resampling scheme such jackknlfing or balanced half 
replication, a less costly approximation for the sampling variance 

A 

of 8 is V(s(z*,y)) as computed from one randomly selected pseudo-data 
set. These procedures will be recognized as a variation of "multiple 
imputation** procedures for missing data (Hertzog & Rubin, 1983; 
Rubin, 1977, 1978), with latent variables considered 100-percent 
missing — that is, values are not observed from any respondent. 

An important practical advantage of the multiple-Imputation 
approach is that the same collection of pseudo-data sets can be 
used to estimate several different statistics S. A file containing 
R replicates would thus allow the secondary user to estimate without 
additional special programming, any statistic he or she would have 
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liked to calculate had z been observable, along with an Indication 
of Its precision that takes the latency of z Into account* 

tm 

A Numerical Example 
This section applies the procedures outlined above to a small 
example with data from the Profile of American Youth (U.S. 
Department of Defense, 1982). For each respondent, the data consist 
of two demographic variables y (ethnicity and sex) and four 
responses x to items on an aptitude test, assumed to be governed by 
e single latent aptitude variable z. The item response model and 
conditional estimation results are taken from Mislevy (1985); the 
interested reader is referred to this source for additional detail* 
A simplified sampling design (though still more complex than simple 
random sampling) is assumed here for purposes of illustration* 
The Data 

The data we consider were obtained as part of the Profile 
of American Youth , a survey of the aptitudes of a national 
probability sample of Americans aged 16 through 23 in July, 1980. 
Table 1 presents counts of the sixteen possible response patterns 
to four items from the Arithmetic Reasoning subtest of the Armed 
c ei vices Vocational Aptitude Batter (ASVAB) , Form 8A, from 
samples of white males and females and Black males and females* 
A 1 denotes a correct response, while a 0 denotes an Incorrect 
response* Though multiple stages of sampling were employed in 
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the actual design of the study, we shall treat these four groups 
as a stratified random sample from a target population, with 
Blacks sampled at a rate " of double that of whites* 



Insert Table 1 about here 



The Item Response Model 

Let represent the response of person 1 to item j. It is 

assumed that responses are governed by the three-parameter logistic 
item response model (Blrnbaum, 1968), which gives the probability 
of a correct response as 

P(x tJ - 1 Iz^.lyc.,) - P tJ 

- Cj + (1 - Cj )/{1 + exp[-1.7 &j (z 1 - bj)]} 

and the probability of an incorrect response as 

- 0 | z^a^,^) - 1 - P , 

where a ^ , bj , and c^ are parameters that characterize the 
regression of a correct response to item j on z. These parameters, 
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over all four Items , are denoted by 0^ In the general solution 
given above* Under the usual assumption of conditional 
Independence/ the probability of a vector of Item responses 
from person 1 Is given by 

PO^IVV ' * P ij (1 " P lj ) • 

V 

Estimates of the item parameters, based on responses from an 
independent sample of 1178 persons and computed with the BILOG 
computer program (Mislevy & Bock, 1982), appear as Table 2. 



Insert Table 2 about here 



Conditional Distributions 

Conditional multivariate normality under a saturated 
homoscedastic model is assumed so that 

p(z|y<JY,o) - (2iror 1/2 exp[-(z - Y't.) 2 /2o 2 ] , 



where t^ - ^ t il ,t l2 ,t 13 ,t i4^ * 8 a desi 8 n vector associated with 
respondent 1, taking values as follows: 
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.5 if white 
" -.5 if Black 



.5 if male 

t l3 " { 
1 -.5 if female 

.25 if white male or Black female 

C i4- I 

-.25 If Black nale or white female; 

and where y ■ (Y 3**4) represents a constant term, an 

ethnicity effect, a sex effect, and an ethnicity-by-sex 
interaction. The common within cel.* standard deviation is denoted 
a. Together, y and a play the role of (J^, 

Under these assumptions, the conditional likelihood of the 
data in Table 1 is given by 



I* ■ n (x |y ;g) 
i 1 " 



n / p(x 1 |z;0 1 ) p(z|y 1 ;Y,o) dz 
i z 



Equating first derivatives of log L to zero yields likelihood 
equations. For y> after simplification, 
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Y - (T , T)~'t*u , (6) 



A A 



where T - ( )' and y - (y 1 u ) with 

w n I n 



u t - / z p(z|x 1 ,y 1 ;8 1 ,Y»o) dz . (7) 
z 



For o, 



o 2 - n" 1 T. I (z - 1 1 ) 2 p(z|x 1> y 1 ;P 1> Y,o) dz • (8) 
i * 



It will be noted that Y end o appear In the right-hand sides of (7) 

m 

and (8), necessitating Iterative solution* An EM solution proceeds 
In repeated cycles of the form 

E-step: For provisional estimates y * n d 0 » 

a* 

approximate the conditional density by 



M-step: Taking this approximation as known, evalute (6)-(8) 

*(t+l) A (t+1) 
to obtain Improved estimates y ^nd a • 



Latent Populations 

20 

With 6 1 taken as known, the only unknowns are y and o, parameters 
of a distribution In the exponential family; convergence of the 
EM algorithm Is thereby guaranteed (Dempster, Laird, & Rubin, 1977). 
Resulting estimates are 

J - (-.13, .92, .13, .43) 

m 

and 

A 

o - .82 ; 

Implied cell means are 

White males .51 

White females .15 

Black males -.63 

Black females -.55 
Generation of Pseudo Data 

Let U^.-.U^q be a grid of points from -4.875 to 4.875 In 
equally-spaced steps of «25. The continuous distributions given by 

A 

p(z|x^,y^;0) for each respondent In the sample may be approximated 

by discrete distributions over a finite number of points — I.e., 
histograms—as follows: 
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P(U q |x 1 ,y 1 ;e) 



P<U q |x i» y i'*? ) 

£ p(U r |x.,y.;S) 
r r 1 1 • 



Five pseudo-data sets were generated by taking five values at 
random from such a histogram for each respondent in a two-step 
procedure. In the first step of obtaining z* f , a random number 
t^ r from the unit interval was generated to target a block in the 
histogram, namely th».t block k^ such that 



k. -1 k. 
ir ir 

Z p(U |x ,y ;0) < t < Z p(U Jx ,y ;p) . 

q-1 q 1 1 " q-1 q * 



In the second step, a second random number s from the unit interval 
was generated to specify a point in block k : 



z ir ' U k + ,25<8 " ,5) 
ir 



Table 3 gives likelihoods, a conditional distribution, a predictive 
distribution, and pseudo values for a typical respondent. 



Insert Table 3 about here 
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Estimation of Marginal Distributions 

As noted above, It is desired to estimate the overall mean of 
the population under the assumption that sampling was random within 
the strata defined by the cells of the demographic design, with 
sampling probabilities doubled for Blacks. If values of z had been 
observed rather thar x, the estimate of the mean would have been 



- Z ll . *12 , z 2l , *22 
2 " — + — + — + ~ 



where subscripts Identify cells as follows: 

11 * white males, 

12 * white females, 

21 m Black males, and 

22 - Black females. 

Ignoring finite population corrections, an estimate of the 
variance of this estimator Is given by 



2 2 2 2 

A - 8 11 8 19 S 21 8 22 



where n^ Is the sample size In cell jk and 8^ Is the estimated 
standard deviation. 
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Table 4 gives cell means and standard deviations as estimated 
from the five pseudo-data sets. The expectation of the sample 
mean z, given observed data, Is the average of the five pseudo- 
sample means, or .0407. The variance associated with this estimate 
is giveti by averaging values of (9) over pseudo-data sets, or 
•0009, plus the variance among the estimates z* or .0008 to 
yield a final value of .0017. 



Insert Table 4 about here 



Discussion 

A necessary requirement for consistent estimates under the 
approach outlined above is the correct specification of p(zjy). 
When the dimensionalities of z and y are low (e.g., five latent 
variables and five collateral variables), it is possible to obtain 
a detailed nonparametric approximation of this conditional 
distribution (Mislevy, 1984). When dimensionalities of z and 
y run into the hundreds, however, as in a large-scale general- 
purpose survey such as the National Assessment of Educational 
Progress (NAEP), simplifications and computing approximations 
cannot be avoided. This section, therefore, suggests some computing 
approximations and discusses their effects on the estimation of 
statistics such as differences in subpopulation means. 
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A 

P oint estimation of fU The integration over 0 required in (1) 
to obtain p(z|x,y) can be avoided in large samples when p($|x,y) is 

Mi m m 

well-determined from the data. In such cases the imprecision 
associated with an individual's value of z that can be attributed 
to variation in p($|x,y) is negligible, and one may sample values 

mm** 

A A 

from the more tractlble distribution p(z|x,y;p), Where 0 represents 

m m> ** m> 

the maximum likelihood or Bayes modal estimate estimate of &. 

m 

Solutions can be obtained by means of a generalised EM 
algorithm (Dempster, Laird, & Rubin, 1977). Bock and Aitkin (1981) 
give procedures for solving (6) when & 2 is known, and Mislevy 
(1985) gives procedures for solving (7) when fj^ is known and 
p(z |y 1 ;e 2 ^ 18 MVNU^jZ), with t 1 a vector function of y ± expressing 
the dependence r the conditonal mean upon the effects T of 
collateral variables. These presentations are readily combined to 
give a joint solution for f$j and $ 2 * Such an Integrated solution 
for the special case p(z|y^) ~ lid N(y,o) may be found in Rigdon 
and Tsutakawa (1983). 

Multivariate normal conditional distributions . In principle, 
p(z|y) gives the distribution of the latent variables at all 
possible values of y. As the dimensionality of z increases, 
considerations of tractaMllty make it increasingly attractive to 
model these conditional distributions as multivariate normal (MVN) 
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with a ">n dispersion matrix. It oust be emphasised that this Is 
not the same as assuming HVN marginal distributions among the latent 
variables z. Indeed, as the number of collateral variables 
Increases* and to the degree they are correlated with z % the 
estim«_ed marginal distribution of z can become arbitrarily close to 
a true (smooth) distribution of any form* 

Omission of selected Interactions > Even under the assumption 
of conditional multivariate normality, Increasing dimensionality of 
y rapidly overburdens available computing resources If all main 
effects and Interactions of all orders are modeled In p(z|y)« A 
reasonable expedient Is to omit all higher level interactions 
(interactions of order three or higher are rare in behavioral 
research) and, if necessary, many second order interactions as well* 
If main effects only are modeled, analyses of pseudo-data sets will 
capture them correctly but may be in error as to interaction effects. 
The degree of error is reduced to an extent depending on two factors: 
1. \t will be recalled that for each respondent, stage 2 
combines information from the estimated conditional 
distribution PQ(z|y), with information from item responses 
via p(x|z) in order to obtain the predictive distribution 
p(z|x,y) from which random values are selected* Assuming 

m m m 

p(x|z) is correctly specified, one could use the resulting 



9 
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* pseudo-data set to obtain the empirical distribution p f (z|y)» 

lt r n ^*|y) has been correctly specified, p n (z|y) and p'Uly) 
will agree. If P 0 (*|y) has not been correctly specified, 
information from x will cause p'(s|y) to differ from 
p n (z|y) value in the d irection of the true distribution , by 
an amount equai to that achieved in one EM cycle of 
estimation* Aa approximation of this amount can be obtained 
by applying the procedures outlined by Dempster et al. 
(1977, pp. 10-11) to the model that includes the omitted 
terms* 

2. Attenuation of estimates of omitted interactions will also 
be ameliorated to the extent that such effects are 
unrelated with effects that are not omitted. This 
follows from results on the consequences of specification 
error 8 in linear regression models* If data are generated 
in accordance with parameter estimates under a model that is 
mlsspecified by the omission of certain effects, subsequent 
analyses of these data with the correct model will yield 
improved estimates of all effects vnless the omitted effects 
are uncorrelated with those not omitted* 

Omission of selected collateral variables * It may be 

m 

-reasonable to omit nonessential variables from the conditional 
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estimation when the total number of collateral variables Is large. 
Statistics s* based on Included variables only will not suffer from 
this omission; subgroup differences, for example, will be captured 
100-percent If these effects were Included In the conditioning. 
For the reasons cited above, the at t t axation of statistics based on 
omitted variables will not be serious when each respondent provides 
several item responses and as the number of included collateral 
variables increases; subgroup differences on omitted variables, for 
example, will suffer negligible attenuation if included variables 
are chosen carefully. 

Use of reduced variables . The careful choice of variables to 
include in the conditional estimation Includes two considerations. 
First, effects deemed important in their own right should be 
explicitly modeled if possible so that statistics based on their 
joint distributions will suffer no attenuation at all. Examples 
might include key demographic effects, treatment effects, and 
salient interactions. Second, rather than simply omitting 
remaining variables it is preferrable to include a few well-chosen 
linear combinations of remaining variables; e.g., the first four 
principle components, or factor scores based on the first three 
principle factors. Such use of reduced variables guarantees 
efficieuL use of the limited number of effects that can be modeled 
in recapturing to a great extent a wide range of potential 
statistics s*. 
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Footnote 

*But see Spencer (1984) on bootstrapping the aforementioned 
procedures* 
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Table 1 

Counts of Observed Response Patterns 



Ites 



Response 




XJU4 f» A 


wince 


M ark 


SI ark 


1 2 


3 


4 


Males 


Femsles 


Males 


Females 


0 0 


0 


0 


23 


20 


27 


29 


0 0 


0 


1 


5 


8 


5 


8 


0 0 


1 


0 


12 


14 


15 


7 


0 0 


1 


1 


2 


2 


3 


3 


0 1 


0 


0 


16 


20 


16 


14 


0 1 


0 


1 


3 


5 


5 


5 


0 1 


1 


0 


6 


11 


4 


6 


0 i 


1 


1 


1 


7 


3 


0 


1 0 


0 


0 


22 


23 


15 


14 


1 0 


0 


1 


6 


8 


10 


10 


1 0 


1 


0 


7 


9 


8 


11 


1 0 


1 


1 


19 


6 


1 


2 


1 1 


0 


0 


21 


18 


7 


19 


1 1 


0 


1 


11 


15 


9 


5 


1 1 


1 


0 


23 


20 


10 


8 


1 1 


1 


•1 


86 


42 


2 


4 


TOTAL 






263 


228 


140 


145 
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Table 2 
Item Parameters 



Item 


a 


b 


c 


1 


1.27 


-.13 


.20 


2 


1.45 


.42 


.20 


3 


2.49 


.71 


.20 


4 


2.27 


.62 


.20 
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Table 3 

Likelihood, Conditional Density , and Prediction Density 
for a Typical Respondent 

Collateral variables y: Black, fenale Item responses x - 1100 



u 


ofxlu ) 


p(0. |y) 


p(U, Ix.y) 


-4.875 


.026 


.000 


.000 


-4.625 


.026 


.000 


.000 


-4.375 


.026 


.000 


.000 


-4.125 


.026 


.000 


.000 


-3.875 


.026 


.000 


.000 


-3.625 


.026 


.000 


.000 


-3.375 


.026 


.000 


.000 


-3.125 


.026 


.001 


.000 


-2.875 


.026 


.002 


.001 


-2.625 


.026 


.005 


.002 


-2.375 


.027 


.010 


.003 


-2.125 


.027 


.029 


.006 


-1.875 


.028 


.032 


.011 


-1.625 


.030 


.050 


.018 


-1.375 


.034 


.071 


.028 


-1.125 


.039 


.092 


.043 


-0.875 


.049 


.110 


.065 


-0.625 


.066 


.120 


.095 


-0.375 


.092 


.121 


.134 


-0.125 


.130 


.113 


.176 


0.125 


.168 


.097 


.194 


0.375 


.171 


.073 


.149 


0.625 


.113 


.045 


.061 


0.875 


.042 


.022 


.012 
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1. 125 


• 010 


A 1 A 

• 010 


AA 1 

• 001 


1 .375 


AAA 

• 002 


AAC 

• 005 


AAA 

• UUU 


1 • 625 


AAA 
• U00 


• 00Z 


AAA 

•uuu 


1 QIC 

1 .875 


AAA 
• 000 


AAt 

• 001 


AAA 

• UUU 


2* 125 


• 000 


AAA 

• 000 


AAA 

• OUU 


2. 375 


AAA 

• 000 


AAA 

• 000 


AAA 
• OUU 


2.625 


.000 


• 000 


•000 


2.875 


.000 


• 000 


•000 


3.125 


.000 


• 000 


.000 


3.375 


.000 


• 000 


.000 


3.625 


.000 


• 000 


.000 


3.875 


.000 


• 000 


.000 


4.125 


.000 


• 000 


.000 


4.375 


.000 


.000 


.000 


4.625 


.000 


.000 


•000 


4.875 


.000 


.000 


.000 



Mean and standard deviation of Pdljjx.y): -.223, .614 

Five randomly selected points: .058, .333, -.352, .009, .176 
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Table 4 

Estimated Population and Subpopulation Means 



Pseudo-Data Set 

1 2 3 4 5 



Subpopulation 


Mean 


Var. 


Mean 


Vat. 


Mean 


Var. 


Mean 


Var. 


Mean 


Var. 


White males 


.4840 


.6928 


.5276 


.8158 


.5461 


.7547 


.5403 


.7359 


.4964 


.6825 


White females 


.0804 


.7570 


.2087 


.6814 


.1964 


.6170 


.2078 


.6973 


.1351 


.7056 


Black males 


-.6161 


.6054 


-.6357 


.6527 


-.5792 


.6156 


-.5758 


.6935 


-.6178 


.5573 


Black females 


-.5509 


.5510 


-.5866 


.5898 


-.4833 


.6139 


-.4911 


.5220 


-.4878 


.6269 



Population 

mean (z) -.0064 .0417 .0704 .0716 .0262 

Var (z) .0009 .0009 .0008 .0009 .0009 
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